A Treebank of Spanish and its Application to Parsing

نویسندگان

  • Antonio Moreno
  • Ralph Grishman
  • Susana López
  • Fernando Sánchez-Figueroa
  • Satoshi Sekine
چکیده

This paper presents joint research between a Spanish team and an American one on the development and exploitation of a Spanish treebank. Such treebanks for other languages have proven valuable for the development of high-quality parsers and for a wide variety of language studies. However, when the project started, at the end of 1997, there was no syntactically annotated corpus for Spanish. This paper describes the design of such a treebank and its initial application to parser construction. 1. Constructing a Spanish treebank 1.1. Preliminary considerations This paper presents joint research between a Spanish team and an American one on the development and exploitation of a Spanish treebank. Such treebanks for other languages have proven valuable for the development of high-quality parsers and for a wide variety of language studies. As there was no previous experience in building a syntactically annotated corpus for Spanish, the first effort consisted necessarily in writing a set of annotation guidelines. The starting point was the existing documentation at that time, especially the Penn Treebank project (Marcus, Santorini and Marcinkiewicz, 1993; Bies et al., 1995), the EAGLES preliminary recommendations (EAGLES, 1996), and the Negra corpus (Skut et al., 1997). Our experience in developing Spanish NLP systems told us that a pure phrase structure annotation (typical of the English treebanks) would not be enough for inducing relevant rules for Spanish. At the least, information about agreement and syntactic functions is necessary for Spanish, and we wanted to incorporate that information in our trees in the form of features. The treebank has been created mostly by hand, although some automatic pre-tagging of the data is performed, as described below, to speed treebank creation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extracting LTAG Grammars from a Spanish Treebank

Treebank grammars have been known to help in building robust, wide-coverage statistical parsers that also obtain state-of-art accuracies. In this work, we present a system that extracts LTAG grammars for Spanish from a constituency-based Spanish treebank. We evaluate the extracted grammar in terms of its size, its coverage on unseen data and the performance of a supertagger trained on it. The s...

متن کامل

Interactive Predictive Parsing Framework for the Spanish Language

The Interactive Predictive Parsing (IPP) framework allows us the construction of interactive tree annotation systems. These can help human annotators in creating error-free parse trees with little effort (compared to manually post-editing the trees obtained from a completely automatic parser). In this paper we adapt the IPP framework and the IPP-Ann annotation tool for parse of the Spanish lang...

متن کامل

تصحیح خودکار خطا در درخت بانک نحوی با استفاده از یادگیری ماشینی انتقال محور

The Treebank is one of the most useful resources for supervised or semi-supervised learning in many NLP tasks such as speech recognition, spoken language systems, parsing and machine translation. Treebank can be developded in different ways that could be, generally, categorized in manually and statistical approaches. While the resulted Treebank in each of these methods has the annotation error,...

متن کامل

Exploring Morphosyntactic Annotation over a Spanish Corpus for Dependency Parsing

It has been observed that the inclusion of morphosyntactic information in dependency treebanks is crucial to obtain high results in dependency parsing for some languages. In this paper we explore in depth to what extent it is useful to include morphological features, and the impact of diverse morphosyntactic annotations on statistical dependency parsing of Spanish. For this, we give a detailed ...

متن کامل

Statistical Parsing of Spanish and Data Driven Lemmatization

Although parsing performances have greatly improved in the last years, grammar inference from treebanks for morphologically rich languages, especially from small treebanks, is still a challenging task. In this paper we investigate how state-of-the-art parsing performances can be achieved on Spanish, a language with a rich verbal morphology, with a non-lexicalized parser trained on a treebank co...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000